This document describes how we map the checklist data to Darwin Core. The source file for this document can be found here.

Load libraries:

library(tidyverse)      # Data manipulation
library(data.table)     # Data reading
library(obisdi)         # Tools for data ingestion for OBIS
library(here)           # Get paths (important!)
library(arrow)          # To deal with parquet files

1 Read source data

The checklist will be downloaded from FigShare. We use the obisdi function to do the download and also to obtain metadata. Because the files are large, we added a line to control and only download the data once and save the resulting metadata:

# Get the path to data/raw
raw_path <- here("data", "raw")

# See if files were already downloaded
lf <- list.files(raw_path)
if (!any(grepl("figshare", lf))) {
  fig_details <- get_figshare(article_id = 7854767, download_files = T,
                              save_meta = T, path = raw_path)
}

Following the download the details of the dataset can be accessed from the file data/raw/figshare_metadata_20062023.csv.

Title: A fine-tuned global distribution dataset of marine forests
Authors: Jorge Assis, Eliza Fragkopoulou, Duarte Frade, João Neiva, André Oliveira, David Abecasis, Silvan Faugeron, Ester A. Serrão
Date (dmy format): 19/03/2020
DOI: 10.6084/m9.figshare.7854767.v1
URL: https://figshare.com/articles/dataset/A_fine-tuned_global_distribution_dataset_of_marine_forests/7854767

2 Preprocessing

First we reduce the size of the raw files by converting them to the parquet format. We keep only the flagged file which is the one that we will include in the OBIS database.

raw_files <- list.files(raw_path, full.names = T)
file.remove(raw_files[-grep("databaseAll.csv|databaseAll.parquet|metadata", raw_files)])

# We just run the conversion in the first knitting of this document
if (any(grepl("databaseAll.csv", raw_files))) {
  flagged <- fread(paste0(raw_path, "/databaseAll.csv"))
  write_parquet(flagged, paste0(raw_path, "/databaseAll.parquet"))
  rm(flagged)
  file.remove(paste0(raw_path, "/databaseAll.csv"))
}

Now we can load the parquet file containing the dataset we will work with.

dataset <- read_parquet(paste0(raw_path, "/databaseAll.parquet"))
head(dataset)

We will filter the dataset to remove those records that are already available on OBIS. In that case, we will filter by “Ocean Biogeographic Information System” (old name) and “Ocean Biodiversity Information System”.

dataset_filt <- dataset %>%
  mutate(proc_bibliographicCitation = tolower(bibliographicCitation)) %>%
  filter(!grepl("ocean biogeographic information system|ocean biodiversity information system", proc_bibliographicCitation)) %>%
  select(-proc_bibliographicCitation)

3 Darwin Core mapping

This dataset is already on the DwC standard, so no mapping will be necessary. However, we need to separate the flags into a new table, what will contain the MeasurementOrFacts:

flags <- dataset_filt %>%
  select(id, starts_with("flag"))

Now we convert the flags object to the right format:

flags_conv <- flags %>%
  pivot_longer(cols = 2:4,
               names_to = "measurementType",
               values_to = "measurementValue") %>%
  mutate(measurementValue = as.numeric(measurementValue))

We can check the conversion worked by tabulating the values:

cbind(data.frame(table(flags$flagHumanCuratedDistribution)),
               Freq_conv = data.frame(table(
                 flags_conv$measurementValue[flags_conv$measurementType == "flagHumanCuratedDistribution"]
               ))[,2])
cbind(data.frame(table(flags$flagMachineOnLand)),
               Freq_conv = data.frame(table(
                 flags_conv$measurementValue[flags_conv$measurementType == "flagMachineOnLand"]
               ))[,2])
cbind(data.frame(table(flags$flagMachineSuitableLightBottom)),
               Freq_conv = data.frame(table(
                 flags_conv$measurementValue[flags_conv$measurementType == "flagMachineSuitableLightBottom"]
               ))[,2])

That’s all we needed to do with the data for now.

4 Post-processing

As a final step, we just remove the MeasurementOrFact column of the other object, as this will be supplied to the IPT in a different file.

dataset_filt <- dataset_filt %>%
  select(-starts_with("flag"))

And those are the final objects:

dataset_filt
flags_conv

5 Export final files

We can then save the final files:

processed_path <- here("data", "processed")

write_csv(flags_conv, paste0(processed_path, "/extension.csv.gz"))

write_csv(dataset_filt, paste0(processed_path, "/occurrences.csv.gz"))

And we check if the files are saved:

list.files(processed_path)
## [1] "extension.csv.gz"   "occurrences.csv.gz"


Dataset edited by the OBIS secretariat.

The harvesting of this data to OBIS is part of the MPA Europe project.

MPA Europe project has been approved under HORIZON-CL6-2021-BIODIV-01-12 — Improved science based maritime spatial planning and identification of marine protected areas.

Co-funded by the European Union. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or UK Research and Innovation. Neither the European Union nor the granting authority can be held responsible for them


OBIS Data Ingestion | Ocean Biodiversity Information System (obis.org). For more information on how to contribute data to OBIS, see the OBIS manual. Created with the obisdi package.